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Abstract 

We  present  a  k  x  d  random  projection  matrix  that  is  applicable  to  vectors  x  €  Rd  in  O(d)  operations 
if  d  >  k2+s .  Here,  k  is  the  minimal  Johnson  Lindenstrauss  dimension  and  5’  is  arbitrarily  small.  The 
projection  succeeds,  with  probability  1  —  1  /n,  in  preserving  vector  lengths,  up  to  distortion  e,  for  all 
vectors  such  that  <  ||x||2fc_1/,2d~'5  (for  arbitrary  small  6).  Sampling  based  approaches  are  either 

not  applicable  in  linear  time  or  require  a  bound  on  ||a;||  that  is  strongly  dependant  on  d.  Our  method 
overcomes  these  shortcomings  by  rapidly  applying  dense  tensor  power  matrices  to  incoming  vectors. 


1  Introduction 

The  application  of  various  random  matrices  has  become  a  common  method  for  accelerating  algorithms  both 
in  theory  and  in  practice.  These  procedures  are  commonly  referred  to  as  random  projections.  The  critical 
property  of  a  A;  x  d  random  projection  matrix,  4>,  is  that  the  mapping  x  i— ►  4>;r  not  only  reduces  the  dimension 
from  d  to  fc,  but  also  preserves  lengths,  up  to  distortion  e,  with  probability  at  least  1  — 1/n  for  some  small  e  and 
large  n.  The  name  random  projections  was  coined  after  the  first  construction  of  Johnson  and  Lindenstrauss 
[1]  in  1984,  who  showed  that  such  mappings  exist  for  k  >  0(log(n) /e2).  Other  constructions  of  random 
projection  matrices  have  been  discovered  since  [2,  3,  4,  5,  6].  Their  properties  make  random  projections  a 
key  player  in  rank-k  approximation  algorithms  [7,  8,  9,  10,  11,  12,  13,  14],  other  algorithms  in  numerical 
linear  algebra  [15,  16,  17],  compressed  sensing  [18],  and  various  other  applications,  e.g,  [19,  20]. 

Considering  the  usefulness  of  random  projections  it  is  natural  to  ask  the  following  question:  what  should 
be  the  structure  of  a  random  projection  matrix,  4>,  such  that  mapping  x  i— >  dhr  would  require  the  least 
amount  of  computation?  A  naive  construction  of  a  k  x  d  unstructured  matrix  <J>  would  result  in  an  O(dk) 
application  cost.  This  is  prohibitive  even  for  moderate  values  of  k  and  d. 
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In  [21],  Ailon  and  Chazelle  proposed  the  first  Fast  Johnson  Lindenstrauss  Transform.  Their  matrix  is 
a  composition  of  a  sparse  sampling  matrix  and  a  discrete  Fourier  matrix.  This  achieves  a  running  time  of 
0(dlog(d)  +  k3).  Recently,  Ailon  and  Liberty  [22]  further  improved  this  to  C^dlog^))1  by  composing  a 
deterministic  code  matrix  and  a  randomized  block  diagonal  matrix.  The  idea  behind  both  fast  constructions 
is  similar:  they  start  with  applying  a  randomized  isometric  d  x  d  matrix  T,  which  maps  all  vectors  in  Rd 
(w.h.p)  into  a  set  %  C  and  then  use  a  k  x  d  matrix  A  to  project  all  vectors  from  \  to  Rfc.  There  seems 
to  be  a  tradeoff  between  the  possible  computational  efficiency  of  applying  A  and  the  size  of  x-  the  smaller 
X  is,  the  faster  A  can  be  applied.  This,  however,  might  require  a  time  costly  preprocessing  application  of  T. 

In  the  present  work  we  examine  the  connection  between  A  and  x  for  any  matrix  A  (Section  2).  We 
propose  in  Section  3  a  new  type  of  fast  applicable  matrices  and  in  Section  4  explore  their  x-  These  matrices 
are  constructed  using  tensor  products  and  can  be  applied  to  any  vector  in  in  linear  time,  i.e,  in  O(d). 
Due  to  the  similarity  in  their  construction  to  Walsh-Hadamard  matrices  and  their  rectangular  shape  we  term 
them  Lean  Walsh  Matrices2 . 


The  rectangular  k  x  d  matrix  A 

Application 

time 

x  €  x  if  IMI2  =  1  and: 

Johnson,  Lindenstrauss  [1] 

k  rows  of  a  random  unitary  matrix 

0(kd) 

Various  Authors  [2,  4,  5,  6] 

i.i.cl  random  entries 

0(dk) 

Ailon,  Chazelle  [21] 

Sparse  Gaussian  entrees 

0(k 3) 

iMioo  <  o{(d/k r1/2) 

Ailon,  Liberty  [22] 

4-wise  independent  Code  matrix 

0(d  log  k) 

W4  <  o(d -1/4) 

This  work 

Any  deterministic  matrix 

? 

IML  <  o(fc_1/2) 

This  work 

Lean  Walsh  Transform 

0{d) 

IMioo  ^  0{k~l/2d~ 5) 

Table  1:  Types  of  k  x  d  matrices  and  the  subsets  x  of  for  which  they  constitute  a  random  projection. 
The  norm  ||  •  ||  4  is  defined  below. 


Due  to  their  construction  the  Lean  Walsh  matrices  are  of  size  d  x  d  where  d  =  da  for  some  0  <  a  <  1. 
In  order  to  reduce  the  dimension  to  k  <  d,  k  =  0(log(n)/e2)),  we  compose  the  lean  Walsh  matrix,  A,  with 
a  known  Johnson  Lindenstrauss  matrix  construction  R.  Applying  R  in  0{d)  requires  some  relation  between 
d,  k  and  a  as  explained  in  subsection  4.1. 

2  Norm  concentration  and  x(^4 ,£,n) 

We  compose  an  arbitrary  deterministic  d  x  d  matrix  A  and  random  sign  diagonal  matrix  Ds  and  study 
the  behavior  of  such  matrices  as  random  projections.  In  order  for  ADS  to  exhibit  the  property  of  a  random 

1Their  method  applies  to  cases  for  which  k  <  r/1/2  ~;)  for  some  arbitrary  small  <5. 

2  The  terms  Lean  Walsh  Transform  or  simply  Lean  Walsh  are  also  used  interchangeably. 
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projection  it  is  enough  for  it  to  preserve  the  length  of  any  single  unit  vector  x  £  Rd  with  very  high  probability: 

Pr  |  ||ADsa;||2  -  1  |  >  e)  <  1/n  (1) 

Here  Ds  is  a  diagonal  matrix  such  that  Ds(i,i)  are  random  signs  (i.i.cl  ±1  w.p  1/2  each)  and  n  is  chosen 
according  to  a  desired  success  probability,  usually  polynomial  in  the  intended  number  of  projected  vectors. 

Note  that  we  can  replace  the  term  ADS x  with  ADxs  where  Dx  is  a  diagonal  matrix  holding  on  the 
diagonal  the  values  of  x,  i.e  Dx(i,i)  =  x(i)  and  similarly  s(i)  =  Ds(i,i ).  Denoting  M  =  ADX,  we  view  the 
term  ||Ms||2  as  a  function  over  the  product  space  {1,  — l}rf  from  which  the  variable  s  is  uniformly  chosen. 
This  function  is  convex  over  [—1,  l]d  and  Lipschitz  bounded.  In  his  book,  Talagrand  [23]  describes  a  strong 
concentration  result  for  such  functions. 

Lemma  2.1  (Talagrand  [23]).  Given  a  matrix  M  and  a  random  vector  s  (s(i)  are  i.i.d  ±1  w.p  1/2/  define 
the  random  variable  Y  =  ||Ms||2.  Denote  by  p  the  median  ofY,  and  by  a  =  ||M||2^2  the  spectral  norm  of 
M.  Then 

Pr[\Y  -  p\>t\<4e~t2/8a2  (2) 

Lemma  2.1  asserts  that  ||HDxs||  is  distributed  like  a  (sub)  Gaussian  around  its  median,  with  standard 
deviation  2a. 

First,  in  order  to  have  E[Y2]  =  1  it  is  necessary  and  sufficient  for  the  columns  of  A  to  be  normalized  to 
1  (or  normalized  in  expectancy).  To  estimate  the  median,  p,  we  substitute  t2  — ■>  t'  and  compute: 

POO 

E[(Y-p)2}  =  /  Pr  {(Y-p)2}>t'}dt' 

Jo 

p  oo 

<  /  4e_t/(8cr2)dt/  =  32cr2 

Jo 

Furthermore,  {E\Y])2  <  E\Y2]  =  1,  and  so  E[{Y  —  p)2)  =  E[Y2}  —  2 pE[Y]  +  p2  >  1  —  2p  +  p2  =  (1  —  p)2 . 
Combining,  |1  —  p\  <  y/32a.  We  set  e  =  t  +  [1  —  p\- 

Pr[| Y  -l\>e\<  4e~£2/32a2  ,for  £  >  2|1  -  p\  (3) 

If  we  set  k  =  331og(n)/£2  (assuming  log(?i)  is  larger  than  some  constant)  the  requirement  of  equation  1  is 
met  for  er  <  /c-1/2.  Moreover  £  >  2 1 1  —  p\.  We  see  that  a  condition  on  cr  =  ||HDa;||2^2  is  sufficient  for  the 
projection  to  succeed  w.h.p.  This  naturally  defines  %. 

Definition  2.1.  For  a  given  matrix  A  we  define  the  vector  pseudonorm  of  x  with  respect  to  A  as  = 

||  ADX || 2 _ >2  where  Dx  is  a  diagonal  matrix  such  that  Dx{i,i)  =  x(i). 

Definition  2.2.  We  define  y(H,  £,n)  as  the  intersection  of  the  Euclidian  unit  sphere  and  a  ball  of  radius 
/c-1/2  in  the  norm  ||  •  ||A 

\(A,  £,  n)  =  jz  e  Rd  |  ||a;||2  =  1,  ||a:||  A  <  fc^1/2|  (4) 

for  k  =  331og(n)/£2. 
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Lemma  2.2.  For  any  column  normalized  matrix,  A,  and  an  i.i.d  random  ±1  diagonal  matrix,  Ds,  the 
following  holds: 


\/x  €  x(A  £i  n)  Pr 

Proof.  For  any  x  £  x>  by  definition  2.2,  ||x||A 
substituting  the  value  of  a  into  equation  3. 


\\\ADsx\\l 


1|  >  e 


<  1/n 


=  \\ADX ||2^2  =  a  <  k  V2. 


(5) 

The  lemma  follows  from 

□ 


It  is  convenient  to  think  about  x  as  the  ’’good”  set  of  vectors  for  which  ADS  is  length  preserving  with 
high  probability.  En  route  to  explore  x{A,  e> n)  f°r  lean  Walsh  matrices  we  first  turn  to  formally  defining 
them. 


3  Lean  Walsh  transforms 

The  Lean  Walsh  Transform,  similar  to  the  Walsh  Transform,  is  a  recursive  tensor  product  matrix.  It  is 
initialized  by  a  constant  seed  matrix,  Ai,  and  constructed  recursively  by  using  Kronecker  products  Ay  = 
A\  ®  Ay_ i.  The  main  difference  is  that  the  Lean  Walsh  seeds  have  fewer  rows  than  columns.  We  formally 
define  them  as  follows: 

Definition  3.1.  Ai  is  a  Lean  Walsh  seed  (or  simply  ’seed')  if  i)  A\  is  a  rectangular  matrix  Ai  £  Crxc, 
such  that  r  <  c;  ii)  A\  is  absolute  valued  1  /  ffr  entree-wise,  i.e,  \Affi,j)\  =  r-1/2;  Hi)  the  rows  of  Ai  are 
orthogonal;  and  iv)  all  inner  products  between  its  different  columns  are  equal  in  absolute  value  to  a  constant 
P  <  l/\/(c-  1).  p  is  called  the  Coherence  of  Ai . 

Definition  3.2.  A e  is  a  Lean  Walsh  transform,  of  order  t,  if  for  all  £’<£  we  have  A'(  =  A\  ®  Ay- 1,  where 
(g)  stands  for  the  Kronecker  product  and  A1  is  a  seed  according  to  definition  3.1. 

The  following  are  examples  of  seed  matrices: 

(l  1  -1  -l\ 

4  =  75  1  -1  1  -1 

V  i  -i  -i  1  / 

r’  =  3 ,d  =  4,  p’  =  1/3 

These  examples  are  a  part  of  a  large  family  of  possible  seeds.  This  family  includes,  amongst  other  construc¬ 
tions,  sub-Hadamard  matrices  (like  A\  j  or  sub-Fourier  matrices  (like  A'().  A  simple  construction  is  given  for 
possible  larger  seeds. 

Fact  3.1.  Let  F  be  the  c  x  c  Discrete  Fourier  matrix  such  that  F(i,j)  =  e27rv/-Hj/c_  Define  A\  to  be  the 
matrix  consisting  of  the  first  r  =  c—  1  rows  of  F  normalized  by  l/y/r.  Ai  is  a  lean  Walsh  seed  with  coherence 
1/r. 


A"  —  J_ 
A1  —  ,/0 


(6) 


=  2,  c"  =  3,  p"  =  1/2 
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Proof.  The  facts  that  \A\(i,j)\  =  1/i/r  and  that  the  rows  of  A\  are  orthogonal  are  trivial.  Moreover,  due  to 
the  orthogonality  of  the  columns  of  F,  the  inner  product  of  two  different  columns  of  A4  must  equal  p  =  1/r 
in  absolute  value. 


(A[jl\A[j2) 


i 

here  F(-,  •)  stands  for  the  complex  conjugate  of  F(-,  •). 


=  -  \-F(c,j1)F(c,j2)\  =  ~ 


(7) 

□ 


We  use  elementary  properties  of  Kronecker  products  to  characterize  At  in  terms  of  the  number  of  rows, 
r,  the  number  of  columns,  c,  and  the  coherence,  p,  of  A\.  The  following  facts  hold  true  for  Af. 

Fact  3.2.  i)  At  is  of  size3  da  x  d,  where  a  =  log(r)/ log(c)  <  1  is  the  skewness  of  A\  ii)  for  all  i  and  j, 
Ae(i,j)  G  id-1/2  which  means  that  At  is  column  normalized;  and  Hi)  the  rows  of  At  are  orthogonal. 

Fact  3.3.  The  time  complexity  of  applying  At  to  any  vector  z  G  is  O(d). 

Proof.  Let  z  =  [z\  \ . . . ;  zc\  where  zt  are  sections  of  length  d/c  of  the  vector  2.  Using  the  recursive  decom¬ 
position  for  At  we  compute  Atz  by  first  summing  over  the  different  Zi  according  to  the  values  of  A\  and 
applying  to  each  sum  the  matrix  At-\.  Denoting  by  T(d)  the  time  to  apply  At  to  z  £  we  get  that 
T(d)  =  rT(d/c )  +  rd.  Due  to  the  Master  Theorem,  and  the  fact  that  r  <  c  we  have  that  T(d)  =  0(d).  More 
precisely,  T(d)  <  dcr/(c  —  r).  □ 


For  clarity,  we  demonstrate  Fact  3.3  for  (equation  6): 


l  Zl\ 


A'ez  —  A't 


z2 

Z3 


\Z4J 


1 

71 


/  A'£_1(zi  +  z2  -  z3  -  z4)  \ 
A't_1(z1  -  z2  +  Z3  -  Z4) 

\  A'p-^zi  —  z2  —  Z3  +  z4)  J 


(8) 


In  what  follows  we  characterize  x(A,  s,  n )  for  a  general  Lean  Walsh  transform  by  the  parameters  of  its  seed, 
r,  c  and  p.  The  omitted  notation,  A ,  stands  for  At  of  the  right  size  to  be  applied  to  x,  i.e,  £  =  log(d)/ log(c). 
Moreover,  we  freely  use  a  to  denote  the  skewness  log(r)/ log(c)  of  the  seed  at  hand. 

3The  size  of  .4 ,  is  rl  X  P .  Since  the  running  time  is  linear,  we  can  always  pad  vectors  to  be  of  length  P  without  effecting 
the  asymptotic  running  time.  From  this  point  on  we  assume  w.l.o.g  d  =  p  for  some  integer  (. 
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4  An  £p  bound  on  |  • 


After  describing  the  lean  Walsh  transforms  we  turn  our  attention  to  exploring  their  ’’good”  sets  \  -  We  remind 
the  reader  that  ||a;||A  <  /c-1/2  entails  x  £  x: 


IML  =  \\adx\\2^2=  “ax  \\yTADxU2 

vMy\\2=1 


=  max  x2(i)(yT  A^)2 

y.l|y||2=1 

(  d  \1/p  f  d  \  1,q 

<  SW’M  E«W 


^2=1 


vM\2=1  “l1 


(9) 

(10) 

(11) 

(12) 


The  transition  from  the  second  to  the  third  line  follows  from  Holder’s  inequality  for  dual  norms  p  and  q, 
satisfying  1/p  +  1/q  =  1.  We  are  now  faced  with  the  computing  ||AT||2^2(}  in  order  to  obtain  the  constraint 
on  \\x\\2p- 

Theorem  4.1.  [Riesz-Thorin]  For  an  arbitrary  matrix  B,  assume  <  C\  and  ||H||P2_>r2  <  (72 

for  some  norm  indices  Pi,ri,p2,r2  such  that  p\  <  r\  and  P2  <  r2.  Let  X  be  a  real  number  in  the  interval 
[0,1],  and  let  p,r  be  such  that  1/p  =  A(l/pi)  +  (1  —  A)(l/p2)  and  1/r  =  A(l/ri)  +  (1  —  A)(l/r2).  Then 
\\B\\p^r  <  CfC'~x. 

In  order  to  use  the  theorem,  let  us  compute  ||AT||2^2  and  ||AT||2^00.  From  ||AT||2^2  =  ||A||2^2  and 
the  orthogonality  of  the  rows  of  A  we  get  that  ||AT||2^2  =  \J d/d  =  d From  the  normalization  of 
the  columns  of  A  we  get  that  ||AT||2^  =  1.  Using  the  theorem  for  A  =  1/q,  for  any  q  >  1,  we  obtain 

||At||2^2(?  <  d/1~°^l‘lq.  It  is  worth  noting  that  ||  AT\\2^2q  might  actually  be  significantly  lower  then  the  given 
bound.  For  a  specific  seed,  A\,  one  should  calculate  \\Af  \\2^2q  an(l  use  \\AJ \\2^2q  =  PfllL2?  t0  achieve  a 
possibly  lower  value  for  ||AT||2^2(?. 

Lemma  4.1.  For  a  lean  Walsh  transform,  A,  we  have  that  for  any  p  >  1  the  following  holds: 

{X  G  R"  I  IMI2  =  1,  IWI2p  <  A-1/2d-^(1-^}  C  x(A,  e, n)  (13) 

where  k  =  0(\og(n)/e2),  a  =  log(r)/log(c),  r  is  the  number  of  rows,  and  c  is  the  number  of  columns  in  the 
seed  of  A. 


Proof.  We  combine  the  above  and  use  the  duality  of  p  and  q: 


VI 

T[ 

M2p\\AT\\2^2q 

(14) 

< 

II  x\\2pdS£ 

(15) 

< 

1 

(16) 

The  desired  property,  ||ar  |A  <  k  1/2,  is  achieved  if  ||ar ||2  <  k  l/2d  2  (1  ^  for  any  p  >  1. 

□ 
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4.1  Controlling  a  and  choosing  R 

We  see  that  increasing  a  is  beneficial  from  the  theoretical  stand  point  since  it  weakens  the  constraint  on 
||a:||  .  However,  the  application  oriented  reader  should  keep  in  mind  that  this  requires  the  use  of  a  larger 
seed,  which  subsequently  increases  the  constant  hiding  in  the  big  O  notation  of  the  running  time. 

Consider  the  seed  constructions  described  in  Fact  3.1  for  which  r  =  c  —  1.  Their  skewness  a  = 
log(r)/ log(c)  approaches  1  as  their  size  increases.  Namely,  for  any  positive  constant  S  there  exists  a  constant 
size  seed  such  that  1  —  25  <  a  <  1. 

Lemma  4.2.  For  any  positive  constant  6  >  0  there  exists  a  Lean  Walsh  matrix,  A,  such  that: 

{x  e  I  IMI2  =  1 ,  IMloo  <  k~1/2d~s}  C  x(Ae,n)  (17) 

Proof.  Generate  A  from  a  seed  such  that  its  skewness  a  =  log(r) /  log(c)  >1  —  2 <5  and  substitute  p  =  oo  into 
the  statement  of  Lemma  4.1.  □ 

The  constant  a  also  determines  the  minimal  dimension  d  (relative  to  k)  for  which  the  projection  can  be 
completed  in  0{d)  operations,  the  reason  being  that  the  vectors  z  =  ADsx  must  be  mapped  from  dimension 
d  (d  =  da)  to  dimension  k  in  O(d)  operations.  This  is  done  using  the  Ailon  and  Liberty  [22]  construction 
serving  as  the  random  projection  matrix  R.  R  is  a  k  x  d  Johnson  Lindenstrauss  projection  matrix  which  can 
be  applied  in  dlog(fc)  operations  if  d  =  da  >  k2+s  for  arbitrary  small  5".  For  the  same  choice  of  a  seed  as 
in  lemma  4.2,  the  condition  becomes  d  >  k2+s  +2S  which  can  be  achieved  by  d  >  k2+s  for  arbitrary  small  S' 
depending  on  6  and  5" .  Therefore  for  such  values  of  d  the  matrix  R  exists  and  requires  0(da  log(fc))  =  0(d) 
operations  to  apply. 

5  Conclusion  and  work  in  progress 

We  have  shown  that  any  k  x  d  (column  normalized)  matrix,  A ,  can  be  composed  with  a  random  diagonal 
matrix  to  constitute  a  random  projection  matrix  for  some  part  of  the  Euclidian  space,  Moreover,  we  have 
given  sufficient  conditions,  on  x  £  for  belonging  to  x  depending  on  different  £ 2  — >  £p  operator  norms  of 
AT  and  lp  norms  of  x.  We  have  also  seen  that  lean  Walsh  matrices  exhibit  both  a  ’’large”  x  and  a  linear 
time  computation  scheme.  These  properties  make  them  good  building  blocks  for  the  purpose  of  random 
projections. 

However,  as  explained  in  the  introduction,  in  order  for  the  projection  to  be  complete,  one  must  design 
a  linear  time  preprocessing  matrix  T  which  maps  all  vectors  in  into  %  (w.h.p).  Achieving  such  T 
would  be  extremely  interesting  from  both  the  theoretical  and  practical  stand  point.  Possible  choices  for  T 
may  include  random  permutations,  various  wavelet/ wavelet-like  transforms,  or  any  other  sparse  orthogonal 
transformation. 

The  authors  would  like  to  thank  Steven  Zucker,  Daniel  Spielman,  and  Yair  Bartal  for  their  insightful 
ideas  and  suggestions. 
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